This document is required to indicate where various requirements can be found within your Final Project Report Rmd. You must indicate line numbers as they appear in your final Rmd document accompanying each of the following required tasks. Points will be deducted if line numbers are missing or differ signficantly from the submitted Final Rmd document.
Description: (1) Analysis includes at least two different data sources. (2) Primary data source may NOT be loaded from an R package–though supporting data may. (3) Access to all data sources is contained within the analysis. (4) Imported data is inspected at beginning of analysis using one or more R functions: e.g., str, glimpse, head, tail, names, nrow, etc
rm(list = ls())
library(rvest)
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.3.2
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'stringr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DataComputing)
## Registered S3 method overwritten by 'mosaic':
## method from
## fortify.SpatialPolygonsDataFrame ggplot2
#data("SanFranPublicBathrooms.csv")
sanfran=read.csv("SanFranPublicBathrooms.csv", header=TRUE)
head(sanfran)
library(remotes)
install_github("yonghah/esri2sf")
## Skipping install of 'esri2sf' from a github remote, the SHA1 (f47eda53) has not changed since last install.
## Use `force = TRUE` to force installation
library("esri2sf")
url <- "https://services.arcgis.com/QPnoxtBFXm6yYtyc/ArcGIS/rest/services/SanFranciscoPoopMap_WFL1/FeatureServer/2"
poop <- esri2sf(url, objectIds = paste(collapse = ","))
## Layer Type: Feature Layer
## Geometry Type: esriGeometryPoint
## Service Coordinate Reference System: 3857
## Output Coordinate Reference System: 4326
str(poop)
## Classes 'sf' and 'data.frame': 39949 obs. of 15 variables:
## $ OBJECTID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Date : int 43466 43467 43599 43468 40177 40177 40177 43469 40177 43473 ...
## $ RequestType : chr "Human or Animal Waste" "Human or Animal Waste" "Human or Animal Waste" "Human or Animal Waste" ...
## $ Address : chr "2048 BRYANT ST, SAN FRANCISCO, CA, 94110" "18 ADAIR ST, SAN FRANCISCO, CA, 94103" "400 GOLDEN GATE AVE, SAN FRANCISCO, CA, 94102" "15 SHARON ST, SAN FRANCISCO, CA, 94114" ...
## $ street : chr "BRYANT ST" "ADAIR ST" "GOLDEN GATE AVE" "SHARON ST" ...
## $ supervisor : int 9 9 6 8 3 3 3 9 6 3 ...
## $ Neighborhood : chr "Mission" "Mission" "Civic Center" "Mission Dolores" ...
## $ policedistrict: chr "MISSION" "MISSION" "TENDERLOIN" "MISSION" ...
## $ latitude : num 37.8 37.8 37.8 37.8 37.8 ...
## $ longitude : num -122 -122 -122 -122 -122 ...
## $ source : chr "Mobile/Open311" "Mobile/Open311" "Mobile/Open311" "Mobile/Open311" ...
## $ URL : chr "http://mobile311.sfgov.org/reports/10302217/photos" "http://mobile311.sfgov.org/reports/10304991/photos" "http://mobile311.sfgov.org/reports/10858769/photos" "http://mobile311.sfgov.org/reports/10311333/photos" ...
## $ CurrentPD : int 3 3 5 3 6 6 6 3 5 4 ...
## $ Neighborhoods : int 53 53 21 37 106 106 106 53 32 50 ...
## $ geoms :sfc_POINT of length 39949; first list element: 'XY' Named num -122.4 37.8
## ..- attr(*, "names")= chr [1:2] "x" "y"
## - attr(*, "sf_column")= chr "geoms"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
## ..- attr(*, "names")= chr [1:14] "OBJECTID" "Date" "RequestType" "Address" ...
head(sanfran)
Description: Students need not use every function and method introduced in STAT 184, but clear demonstration of proficiency should include proper use of 5 out of the following 8 topics from class: (+) various data verbs for general data wrangling like filter, mutate, summarise, arrange, group_by, etc. (+) joins for multiple data tables. (+) spread & gather to stack/unstack variables (+) regular expressions (+) reduction and/or transformation functions like mean, sum, max, min, n(), rank, pmin, etc. (+) user-defined functions (+) loops and control flow (+) machine learning
library(sf)
## Warning: package 'sf' was built under R version 4.3.2
## Linking to GEOS 3.11.2, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
wrangled<-
poop%>%
mutate(URL=NULL,RequestType=NULL,source=NULL)%>%
st_drop_geometry(df)
sanfranBR<-
sanfran%>%
mutate(uid=NULL, water_fountain=NULL, resource_type=NULL,bottle_filler=NULL,jug_filler=NULL, dog_fountain=NULL, notes=NULL, data_source=NULL,data_as_of=NULL,data_loaded_at=NULL)%>%
st_drop_geometry(df)
poopAtRestroom<-
left_join(wrangled, sanfran, by=c('Neighborhood'='analysis_neighborhood'))
## Warning in left_join(wrangled, sanfran, by = c(Neighborhood = "analysis_neighborhood")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
head(poopAtRestroom)
street_ending_finder=function(frame, street.col, street.ending){
pattern <- paste(street.ending, "$", sep="")
pattern
Matches <- frame %>%
filter(grepl(pattern = pattern, street.col, ignore.case = TRUE))
return(Matches)
}
Matches <- street_ending_finder(poop, poop$Address, "ST")
Matches
x <- wrangled %>%
select(Neighborhood) %>%
group_by(Neighborhood) %>%
summarise(poop_count=n())
y <- sanfran %>%
select(analysis_neighborhood) %>%
group_by(analysis_neighborhood) %>%
summarise(restroom_count=n())
joined_table <- x %>%
full_join(y, join_by("Neighborhood" == "analysis_neighborhood"))
joined_table <- joined_table[!is.na(joined_table$Neighborhood),]
joined_table
convert_num_to_date = function(date){
hours_passed = (date %% 39630) * 2
days = hours_passed %/% 24
days
}
poop2 <- wrangled %>% mutate(dayReported=convert_num_to_date(wrangled$Date))
poop2%>%
group_by(dayReported,supervisor)%>%
summarise(count=n())%>%
pivot_wider(names_from=supervisor,values_from = count)
## `summarise()` has grouped output by 'dayReported'. You can override using the
## `.groups` argument.
for (i in 1:length(joined_table$Neighborhood)){
case <- joined_table[i,]
for (j in 1:length(case)){
variable <- joined_table[i,j]
if (is.na(variable) ){
joined_table[i,j] <- 0
}
}
}
We want to explore if a restroom count in a neighborhood has an effect on poop count in a neighborhood. We will use linear regression to explore this.
The setup:
\(poop\_count_i \sim N(b_0 + b_1 \times restroom\_count_i, \sigma^2)\)
The following Hypothesis are used:
\[\begin{align*} H_0 &: b_1 = 0 \\ H_a &: b_1 \neq 0 \end{align*}\]
Approach taught in Stat 415, write a negative log likelihood function.
nll.regression = function(theta, poop_count, restroom_count){
b0 = theta[1]
b1 = theta[2]
sigma_sq = theta[3]
mean = b0 + b1 * restroom_count
-sum(dnorm(poop_count, mean=mean, sd = sigma_sq, log = TRUE))
}
Here we use the optim function which will estimate the MLE for $ b_0, b_1$ and \(\sigma^2\)
out = optim(c(0, 0, 1), nll.regression, poop_count=joined_table$poop_count, restroom_count=joined_table$restroom_count, hessian=TRUE)
b1= out$par[2]
I = out$hessian
se = sqrt(diag(solve(I)))
Calculate the Wald-Test Statistic and check \(P(|Z^*| < 0.05)\)
Z = (b1 - 0)/se[2]
p_val = 2 * dnorm(-abs(Z))
p_val
## [1] 3.836861e-05
Since our p_value is less than 0.05, we can reject our null hypothesis and conclude that the number of restrooms in an area has an effect on the number of poops in that area
Description: Students need not use every function and method introduced in STAT 184, but clear demonstration of proficiency should include a range of useful of data visualizations that are (1) relevant to stated research question for the analysis, (2) include at least one effective display of many–at least 3–variables, and (3) include 3 of the following 5 visualization techniques learned in STAT 184: (+) use of multiple geoms such as points, density, lines, segments, boxplots, bar charts, histograms, etc (+) use of multiple aesthetics–not necessarily all in the same graph–such as color, size, shape, x/y position, facets, etc (+) layered graphics such as points and accompanying smoother, points and accompanying boxplots, overlaid density distributions, etc (+) decision tree and/or dendogram displaying machine learning model results
.Rmd Line number(s) for use of mulitple different geoms:
241,242,265,266
.Rmd Line number(s) for use of multiple aesthetics:
242,265,266
.Rmd Line number(s) for use of layered graphics:
241+242,265+266
.Rmd Line number(s) for use of decision tree or dendogram results:
b0 = out$par[1]
b1 = out$par[2]
reg_func = function(b0, b1, x){
b0 + b1 * x
}
ggplot(joined_table) + geom_point(aes(x = restroom_count, y = poop_count)) +
geom_line(aes(x = restroom_count, y = reg_func(b0, b1, restroom_count), color = "red"))
wrangledA <- data.frame(Neighborhood = sample(c("A", "B", "C", "D"), 100, replace = TRUE))
sanfranA <- data.frame(analysis_neighborhood = sample(c("A", "B", "C", "D"), 120, replace = TRUE))
x <- wrangled %>%
select(Neighborhood) %>%
group_by(Neighborhood) %>%
summarise(poop_count = n())
y <- sanfranA %>%
select(analysis_neighborhood) %>%
group_by(analysis_neighborhood) %>%
summarise(restroom_count = n())
joined_table <- x %>% full_join(y, by = c("Neighborhood" = "analysis_neighborhood"))
ggplot(joined_table, aes(x = Neighborhood)) +
geom_bar(aes(y = poop_count, fill = "Poop Count"), stat = "identity", position = "dodge") +
geom_bar(aes(y = restroom_count, fill = "Restroom Count"), stat = "identity", position = "dodge") +
labs(title = "Poop and Restroom Counts in San Francisco Neighborhoods",
x = "Neighborhood",
y = "Count",
fill = "Legend") +
scale_fill_manual(values = c("Poop Count" = "#FF5733", "Restroom Count" = "#33B5E5")) +
theme_minimal() +
theme(legend.position = "top", legend.title = element_blank()) +
scale_y_continuous(expand = expansion(mult = c(0.1, 0.2)))
## Warning: Removed 4 rows containing missing values (`geom_bar()`).
## Warning: Removed 117 rows containing missing values (`geom_bar()`).
#devtools::install_github("rstudio/leaflet")
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.3.2
Poop_Map <-
leaflet() %>%
addTiles() %>%
addMarkers(clusterOptions = markerClusterOptions(), data=poop) %>%
addCircleMarkers(radius=2, color="red", data=sanfran) %>%
setView(-122.44, 37.76849, zoom=12)%>%
addLegend(position = "bottomleft",
colors = c("red", "orange"),
labels = c("Public_Restrooms", "Poop Clusters are numbered"),
title = "Marker Categories")
## Assuming "longitude" and "latitude" are longitude and latitude, respectively
Poop_Map
All data visualizations must be relevant to the stated research question, and the report must include at least one effective display of many–at least 3–variables
Code quality: Code formatting is consistent with Style
Guide Appendix of DataComputing eBook. Specifically, all code chunks
demonstrate proficiency with (1) meaningful object names (2) proper use
of white space especially with respect to infix operators, chain
operators, commas, brackets/parens, etc (3) use of <-
assignment operator throughout (4) use of meaningful comments.
Narrative quality: The narrative text (1) clearly states one research question that motivates the overall analysis, (2) explains reasoning for each significant step in the analysis and it’s relationship to the research question, (3) explains significant findings and conclusions as they relate to the research question, and (4) is completely free of errors in spelling and grammar
Overall Quality: Submitted project shows significant effort to produce a high-quality and thoughtful analysis that showcases STAT 184 skills. (2) The project must be self-contained, such that the analysis can be entirely rerun without errors. (3) Analysis is coherent, well-organized, and free of extraneous content such as data dumps, unrelated graphs, and other content that is not overtly connected to the research question.
EXTRA CREDIT (1) Project is submitted as a self-contained GitHub Repo (2) project submission is a functioning github.io webpage generated for the project Repo. Note: a link to the GitHub Repo itself will be awarded partial credit, but does not itself qualify as a “webpage” of the analysis.